Care Bank ran a campaign for term-deposit subscriptions last year for its existing customers that showed a healthy conversion rate of over 12%. The bank is interested in a term deposit subscription because it gets good returns from a term deposit than a savings account as the customer is deprived of the rights to access the money prior to the maturity unless the customer is ready to compensate the bank. Banks can use that money to invest in other markets for better returns. Now, the bank is planning to launch a new campaign again but this time bank wants to utilize data available from previous campaigns, and also bank wants to automate this process with better target marketing to increase the success ratio with a minimal budget.
The objective of this project is to build a model that will help the marketing department, in the next campaign, to identify the customers who have a higher probability of subscribing to the term deposit. This will increase the success ratio while at the same time reduce the cost of the campaign.
# To help with reading and manipulation of data
import numpy as np
import pandas as pd
import pprint
from scipy import stats
from scipy.stats import zscore
# To help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import category_encoders as ce
# To split the data
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# to use standard scaler
from sklearn.preprocessing import StandardScaler
# To impute missing values
from sklearn.impute import SimpleImputer
# To build a Random forest classifier
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, RandomForestClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
# To tune a model
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
# To get different performance metrics
import sklearn.metrics as metrics
from sklearn.metrics import (
classification_report,
confusion_matrix,
recall_score,
accuracy_score,
precision_score,
f1_score,
)
# To suppress warnings
import warnings
warnings.filterwarnings("ignore")
main_data = pd.read_csv('Term Deposits Subscription Prediction - DATASET.csv')
TARGET_COLUMN = 'Target'
main_data.head(10)
| age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 58 | management | married | tertiary | no | 2143 | yes | no | unknown | 5 | may | 261 | 1 | -1 | 0 | unknown | no |
| 1 | 44 | technician | single | secondary | no | 29 | yes | no | unknown | 5 | may | 151 | 1 | -1 | 0 | unknown | no |
| 2 | 33 | entrepreneur | married | secondary | no | 2 | yes | yes | unknown | 5 | may | 76 | 1 | -1 | 0 | unknown | no |
| 3 | 47 | blue-collar | married | unknown | no | 1506 | yes | no | unknown | 5 | may | 92 | 1 | -1 | 0 | unknown | no |
| 4 | 33 | unknown | single | unknown | no | 1 | no | no | unknown | 5 | may | 198 | 1 | -1 | 0 | unknown | no |
| 5 | 35 | management | married | tertiary | no | 231 | yes | no | unknown | 5 | may | 139 | 1 | -1 | 0 | unknown | no |
| 6 | 28 | management | single | tertiary | no | 447 | yes | yes | unknown | 5 | may | 217 | 1 | -1 | 0 | unknown | no |
| 7 | 42 | entrepreneur | divorced | tertiary | yes | 2 | yes | no | unknown | 5 | may | 380 | 1 | -1 | 0 | unknown | no |
| 8 | 58 | retired | married | primary | no | 121 | yes | no | unknown | 5 | may | 50 | 1 | -1 | 0 | unknown | no |
| 9 | 43 | technician | single | secondary | no | 593 | yes | no | unknown | 5 | may | 55 | 1 | -1 | 0 | unknown | no |
default_color_palette = ["#f44336", "#e81e63", "#9c27b0", "#673ab7", "#3f51b5", "#2196f3", "#03a9f4", "#00bcd4","#009688","#4caf50","#8bc34a","#cddc39","#ffeb3b","#ffc107","#ff9800", "#ff5722","#795548","#9e9e9e","#607d8b","#03A9F4","#7C4DFF", "#FF5252", "#D50000", "#FF6F00", "#0288D1", "#7C4DFF"]
maritalstatus_color_palette = {'single':'#ff3d00', 'married':'#00c853', 'divorced': '#EA112F'}
stats_colors = {'Mean':'#D50000', 'Mode':'#FF3D00', 'Median':'#2962FF'}
'''
Input:
Target type and a list of feature names.
Output:
Convert all features provided in 'column_names' to Target type provided in 'toType'
Returns:
modifies main original data frame and returns nothing.
'''
def ConvertColTo(toType = "category", column_names=np.nan, df = np.nan):
for col_name in column_names:
df[col_name] = df[col_name].astype(toType)
'''
Input:
Pandas DataFrame
Output:
Displays DataFrame structure
(columns, nulls and non nulls counts and percentage highlighing columns with most number of nulls)
Retunrs:
N/A
'''
def info(dataFrame):
print(f"{dataFrame.shape[0]} Rows x {dataFrame.shape[1]} Columns")
nulls_series = dataFrame.isna().sum() # Get a series counting number of empty values for each column
nonnulls_series = dataFrame.notnull().sum() # Get a series counting number of non empty valuesfor each column
nulls_percentage = ((nulls_series * 100)/(nulls_series + nonnulls_series)).astype(float)
column_datatypes = dataFrame.dtypes # Get a series containing data types for each column
series_arr = [nulls_series, nonnulls_series, nulls_percentage, column_datatypes]
col_names_arr = ["Nulls", "Non-Nulls","Nulls %", "Type"]
nulls_count_df = pd.concat(
objs = series_arr,
axis = 1,
keys = col_names_arr,
sort = True)
cm = sns.light_palette("red", as_cmap=True)
display(nulls_count_df.style.background_gradient(cmap=cm, subset=pd.IndexSlice[:, ['Nulls %']]).format(formatter={('Nulls %'): "{:.2f}%"}))
unknownpoutcome has 80% missing valuesso we have to drop it.We might need to drop the contact column too(it has 28% missing values, thats 13020 rows),we will check the corrolation first then drop it if its too low.education and job with the modein that column.info(main_data)
45211 Rows x 17 Columns
| Nulls | Non-Nulls | Nulls % | Type | |
|---|---|---|---|---|
| Target | 0 | 45211 | 0.00% | object |
| age | 0 | 45211 | 0.00% | int64 |
| balance | 0 | 45211 | 0.00% | int64 |
| campaign | 0 | 45211 | 0.00% | int64 |
| contact | 0 | 45211 | 0.00% | object |
| day | 0 | 45211 | 0.00% | int64 |
| default | 0 | 45211 | 0.00% | object |
| duration | 0 | 45211 | 0.00% | int64 |
| education | 0 | 45211 | 0.00% | object |
| housing | 0 | 45211 | 0.00% | object |
| job | 0 | 45211 | 0.00% | object |
| loan | 0 | 45211 | 0.00% | object |
| marital | 0 | 45211 | 0.00% | object |
| month | 0 | 45211 | 0.00% | object |
| pdays | 0 | 45211 | 0.00% | int64 |
| poutcome | 0 | 45211 | 0.00% | object |
| previous | 0 | 45211 | 0.00% | int64 |
main_data = main_data.replace('unknown', np.nan)
info(main_data)
45211 Rows x 17 Columns
| Nulls | Non-Nulls | Nulls % | Type | |
|---|---|---|---|---|
| Target | 0 | 45211 | 0.00% | object |
| age | 0 | 45211 | 0.00% | int64 |
| balance | 0 | 45211 | 0.00% | int64 |
| campaign | 0 | 45211 | 0.00% | int64 |
| contact | 13020 | 32191 | 28.80% | object |
| day | 0 | 45211 | 0.00% | int64 |
| default | 0 | 45211 | 0.00% | object |
| duration | 0 | 45211 | 0.00% | int64 |
| education | 1857 | 43354 | 4.11% | object |
| housing | 0 | 45211 | 0.00% | object |
| job | 288 | 44923 | 0.64% | object |
| loan | 0 | 45211 | 0.00% | object |
| marital | 0 | 45211 | 0.00% | object |
| month | 0 | 45211 | 0.00% | object |
| pdays | 0 | 45211 | 0.00% | int64 |
| poutcome | 36959 | 8252 | 81.75% | object |
| previous | 0 | 45211 | 0.00% | int64 |
main_data.drop("poutcome", axis=1, inplace=True)
main_data['job'].fillna(main_data['job'].value_counts().idxmax(), inplace=True)
main_data['education'].fillna(main_data['education'].value_counts().idxmax(), inplace=True)
main_data['contact'].fillna("unknown", inplace=True)
info(main_data)
45211 Rows x 16 Columns
| Nulls | Non-Nulls | Nulls % | Type | |
|---|---|---|---|---|
| Target | 0 | 45211 | 0.00% | object |
| age | 0 | 45211 | 0.00% | int64 |
| balance | 0 | 45211 | 0.00% | int64 |
| campaign | 0 | 45211 | 0.00% | int64 |
| contact | 0 | 45211 | 0.00% | object |
| day | 0 | 45211 | 0.00% | int64 |
| default | 0 | 45211 | 0.00% | object |
| duration | 0 | 45211 | 0.00% | int64 |
| education | 0 | 45211 | 0.00% | object |
| housing | 0 | 45211 | 0.00% | object |
| job | 0 | 45211 | 0.00% | object |
| loan | 0 | 45211 | 0.00% | object |
| marital | 0 | 45211 | 0.00% | object |
| month | 0 | 45211 | 0.00% | object |
| pdays | 0 | 45211 | 0.00% | int64 |
| previous | 0 | 45211 | 0.00% | int64 |
Update Data Types:
catgry_col_names = main_data.select_dtypes(include=['object']).columns.tolist()
ConvertColTo('category', catgry_col_names, main_data)
info(main_data)
45211 Rows x 16 Columns
| Nulls | Non-Nulls | Nulls % | Type | |
|---|---|---|---|---|
| Target | 0 | 45211 | 0.00% | category |
| age | 0 | 45211 | 0.00% | int64 |
| balance | 0 | 45211 | 0.00% | int64 |
| campaign | 0 | 45211 | 0.00% | int64 |
| contact | 0 | 45211 | 0.00% | category |
| day | 0 | 45211 | 0.00% | int64 |
| default | 0 | 45211 | 0.00% | category |
| duration | 0 | 45211 | 0.00% | int64 |
| education | 0 | 45211 | 0.00% | category |
| housing | 0 | 45211 | 0.00% | category |
| job | 0 | 45211 | 0.00% | category |
| loan | 0 | 45211 | 0.00% | category |
| marital | 0 | 45211 | 0.00% | category |
| month | 0 | 45211 | 0.00% | category |
| pdays | 0 | 45211 | 0.00% | int64 |
| previous | 0 | 45211 | 0.00% | int64 |
EDA:
'''
Input:
N/A
Output:
Go through each categorial column and print unique values for that column.
Retunrs:
N/A
'''
def CountUniqueValues(col_names):
for col_name in col_names:
print(f"======================='{col_name}'==================")
for unique_col_value in main_data[col_name].unique().tolist():
total_count = main_data[col_name].count()
unique_values_count = main_data[main_data[col_name] == unique_col_value][col_name].count()
percentage = str(round((unique_values_count/total_count) * 100, 2))
print(f"{unique_col_value} \t: {unique_values_count} ({percentage}%)")
print(f"=========================================================\n")
'''
Description:
Displays a grid catplots
Input:
A list of column names
'''
def DisplayCountPlotGrid(col_names, hue_name, color_palette):
col_index = 0
for r in range(0, int(len(col_names)), 2):
fig, axs = plt.subplots(
nrows=1, # Number of rows of the grid
ncols=2, # Number of columns of the grid.
figsize=(15,4),
constrained_layout=True)
for index in range(0, 2):
if col_index < int(len(col_names)):
column_name = col_names[col_index]
ax = axs.flat[index]
ax = sns.countplot(
data=main_data,
x=main_data[column_name],
palette=color_palette,
hue=hue_name,
ax = ax)
ax.set_xlabel(column_name)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.set_title(column_name + ' and '+ hue_name +' Profile', fontsize=14)
if int(len(ax.get_xticklabels())) > 14:
ax.set_xticklabels([], rotation=45, ha='right')
else:
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
col_index += 1
'''
Input:
N/A
Output:
Plot box plots in line.
Returns:
N/A
'''
def PlotLineOfBoxPlots(col_names, hue_name, color_palette):
fig, axs = plt.subplots(
len(col_names),
figsize = (15,12),
sharex = False,
sharey = False)
fig.subplots_adjust(top = 4)
for i in range(len(col_names)):
column_name = col_names[i]
sns.boxplot(
data = main_data,
x = column_name,
y = TARGET_COLUMN,
hue = hue_name,
orient = "h",
palette = color_palette,
ax = axs[i])
axs[i].set_xlabel(column_name)
axs[i].legend(loc='upper right')
axs[i].set_title(column_name + ' Profile', fontsize=14)
descrete_data_columns = main_data.select_dtypes(include=['float', 'int64', 'uint8']).columns.tolist()
descrete_data_columns
['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']
catgry_col_names
['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'Target']
CountUniqueValues(catgry_col_names)
======================='job'================== management : 9458 (20.92%) technician : 7597 (16.8%) entrepreneur : 1487 (3.29%) blue-collar : 10020 (22.16%) retired : 2264 (5.01%) admin. : 5171 (11.44%) services : 4154 (9.19%) self-employed : 1579 (3.49%) unemployed : 1303 (2.88%) housemaid : 1240 (2.74%) student : 938 (2.07%) ========================================================= ======================='marital'================== married : 27214 (60.19%) single : 12790 (28.29%) divorced : 5207 (11.52%) ========================================================= ======================='education'================== tertiary : 13301 (29.42%) secondary : 25059 (55.43%) primary : 6851 (15.15%) ========================================================= ======================='default'================== no : 44396 (98.2%) yes : 815 (1.8%) ========================================================= ======================='housing'================== yes : 25130 (55.58%) no : 20081 (44.42%) ========================================================= ======================='loan'================== no : 37967 (83.98%) yes : 7244 (16.02%) ========================================================= ======================='contact'================== unknown : 13020 (28.8%) cellular : 29285 (64.77%) telephone : 2906 (6.43%) ========================================================= ======================='month'================== may : 13766 (30.45%) jun : 5341 (11.81%) jul : 6895 (15.25%) aug : 6247 (13.82%) oct : 738 (1.63%) nov : 3970 (8.78%) dec : 214 (0.47%) jan : 1403 (3.1%) feb : 2649 (5.86%) mar : 477 (1.06%) apr : 2932 (6.49%) sep : 579 (1.28%) ========================================================= ======================='Target'================== no : 39922 (88.3%) yes : 5289 (11.7%) =========================================================
Job¶category = 'job'
CountUniqueValues([category])
======================='job'================== management : 9458 (20.92%) technician : 7597 (16.8%) entrepreneur : 1487 (3.29%) blue-collar : 10020 (22.16%) retired : 2264 (5.01%) admin. : 5171 (11.44%) services : 4154 (9.19%) self-employed : 1579 (3.49%) unemployed : 1303 (2.88%) housemaid : 1240 (2.74%) student : 938 (2.07%) =========================================================
DisplayCountPlotGrid(
col_names = catgry_col_names,
hue_name = category,
color_palette = default_color_palette)
PlotLineOfBoxPlots(
col_names = descrete_data_columns,
hue_name = category,
color_palette = default_color_palette)
Marital Status¶category = 'marital'
CountUniqueValues([category])
======================='marital'================== married : 27214 (60.19%) single : 12790 (28.29%) divorced : 5207 (11.52%) =========================================================
DisplayCountPlotGrid(
col_names = catgry_col_names,
hue_name = category,
color_palette = maritalstatus_color_palette)
PlotLineOfBoxPlots(
col_names = descrete_data_columns,
hue_name = category,
color_palette = maritalstatus_color_palette)
Education Status¶category = 'education'
CountUniqueValues([category])
======================='education'================== tertiary : 13301 (29.42%) secondary : 25059 (55.43%) primary : 6851 (15.15%) =========================================================
DisplayCountPlotGrid(
col_names = catgry_col_names,
hue_name = category,
color_palette = default_color_palette)
PlotLineOfBoxPlots(
col_names = descrete_data_columns,
hue_name = category,
color_palette = default_color_palette)
Housing Status¶category = 'housing'
CountUniqueValues([category])
======================='housing'================== yes : 25130 (55.58%) no : 20081 (44.42%) =========================================================
DisplayCountPlotGrid(
col_names = catgry_col_names,
hue_name = category,
color_palette = default_color_palette)
PlotLineOfBoxPlots(
col_names = descrete_data_columns,
hue_name = category,
color_palette = default_color_palette)
default Status¶category = 'default'
CountUniqueValues([category])
======================='default'================== no : 44396 (98.2%) yes : 815 (1.8%) =========================================================
DisplayCountPlotGrid(
col_names = catgry_col_names,
hue_name = category,
color_palette = default_color_palette)
PlotLineOfBoxPlots(
col_names = descrete_data_columns,
hue_name = category,
color_palette = default_color_palette)
loan Status¶category = 'loan'
CountUniqueValues([category])
======================='loan'================== no : 37967 (83.98%) yes : 7244 (16.02%) =========================================================
DisplayCountPlotGrid(
col_names = catgry_col_names,
hue_name = category,
color_palette = default_color_palette)
PlotLineOfBoxPlots(
col_names = descrete_data_columns,
hue_name = category,
color_palette = default_color_palette)
contact Status¶category = 'contact'
CountUniqueValues([category])
======================='contact'================== unknown : 13020 (28.8%) cellular : 29285 (64.77%) telephone : 2906 (6.43%) =========================================================
DisplayCountPlotGrid(
col_names = catgry_col_names,
hue_name = category,
color_palette = default_color_palette)
PlotLineOfBoxPlots(
col_names = descrete_data_columns,
hue_name = category,
color_palette = default_color_palette)
month Status¶category = 'month'
CountUniqueValues([category])
======================='month'================== may : 13766 (30.45%) jun : 5341 (11.81%) jul : 6895 (15.25%) aug : 6247 (13.82%) oct : 738 (1.63%) nov : 3970 (8.78%) dec : 214 (0.47%) jan : 1403 (3.1%) feb : 2649 (5.86%) mar : 477 (1.06%) apr : 2932 (6.49%) sep : 579 (1.28%) =========================================================
DisplayCountPlotGrid(
col_names = catgry_col_names,
hue_name = category,
color_palette = default_color_palette)
PlotLineOfBoxPlots(
col_names = descrete_data_columns,
hue_name = category,
color_palette = default_color_palette)
Target Status¶category = 'Target'
CountUniqueValues([category])
======================='Target'================== no : 39922 (88.3%) yes : 5289 (11.7%) =========================================================
DisplayCountPlotGrid(
col_names = catgry_col_names,
hue_name = category,
color_palette = default_color_palette)
PlotLineOfBoxPlots(
col_names = descrete_data_columns,
hue_name = category,
color_palette = default_color_palette)
'''
Input:
Axis, column name/x-axis, hue name
Output:
Displays a count plot.
Retunrs:
N/A
'''
def HistBoxplot(box_chart_ax, hist_chart_ax, x_axis, df):
sns.boxplot(
data=df,
x=df[x_axis],
showmeans=True,
ax=box_chart_ax)
sns.histplot(
data=df,
x=df[x_axis],
kde=True,
ax=hist_chart_ax)
hist_chart_ax.axvline(df[x_axis].mean(), # Get the mean of the values in the given column and draw a vertical line that cuts the chart on the mean value
color=stats_colors['Mean'], # Use on of the colors predefined on this notebook
label='Mean', # Set the label to be diplayed on the legend
linestyle="dashed"); # Make the line have dashes
hist_chart_ax.axvline(df[x_axis].median(), # Plot the median line on the chart.
color=stats_colors['Median'], # Use on of the colors predefined on this notebook
label='Median', # Set the label to be diplayed on the legend
linestyle="dashed"); # Make the line have dashes
hist_chart_ax.axvline(df[x_axis].mode()[0], # Plot the mode line on the chart.
color=stats_colors['Mode'], # Use on of the colors predefined on this notebook
label='Mode', # Set the label to be diplayed on the legend
linestyle="dashed"); # Make the line have dashes
hist_chart_ax.legend(loc='upper right')
'''
Input:
N/A
Output:
Displays a a grid of [Boxplot x Distribution chart] for discrete features.
Returns:
N/A
'''
def PlotHistBoxGrid(df):
col_names = df.select_dtypes(include=['float', 'int64', 'uint8']).columns.tolist()
print(col_names)
col_index = 0
for r in range(0, int(len(col_names)), 3):
fig, (box, hist) = plt.subplots(
nrows=2, # Number of rows of the grid
ncols=3, # Number of columns of the grid.
figsize=(15,4),
gridspec_kw={"height_ratios" : (0.25,0.5)},
constrained_layout=True)
for index in range(0, 3):
if col_index < int(len(col_names)):
HistBoxplot(box.flat[index], hist.flat[index], col_names[col_index], df)
col_index += 1
'''
Input:
Column name
Output:
A series containing interquatile range values
Retunrs:
A dictionary containing quatile range values
'''
def Get_IQR(data):
quartiles = np.quantile(data, [.25, .75])
iqr = (quartiles[1] - quartiles[0])
return {
"Q1": quartiles[0],
"Q3": quartiles[1],
"IQR": iqr
}
def RemoveOutliers(df):
main_data_copy = df.copy()
col_names = main_data_copy.select_dtypes(include=['float', 'int64']).columns.tolist()
col_names.remove(TARGET_COLUMN)
for col_name in col_names:
quatiles = Get_IQR(main_data_copy[col_name])
scale = 1.5
lower_quatile = quatiles["Q1"] - scale * quatiles["IQR"]
upper_quatile = quatiles["Q3"] + scale * quatiles["IQR"]
main_data_copy[col_name] = np.where(
main_data_copy[col_name] < lower_quatile,
main_data_copy[col_name].mode(),
main_data_copy[col_name])
main_data_copy[col_name] = np.where(
main_data_copy[col_name] > upper_quatile,
main_data_copy[col_name].mode(),
main_data_copy[col_name])
return main_data_copy
PlotHistBoxGrid(main_data)
['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']
Balance Outlier Treatment¶main_data[main_data['balance'] > 80000]
| age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 26227 | 59 | management | married | tertiary | no | 98417 | no | no | telephone | 20 | nov | 145 | 5 | -1 | 0 | no |
| 39989 | 51 | management | single | tertiary | no | 102127 | no | no | cellular | 3 | jun | 90 | 1 | -1 | 0 | no |
| 42558 | 84 | retired | married | secondary | no | 81204 | no | no | telephone | 28 | dec | 679 | 1 | 313 | 2 | yes |
| 43393 | 84 | retired | married | secondary | no | 81204 | no | no | telephone | 1 | apr | 390 | 1 | 94 | 3 | yes |
main_data = main_data[main_data['balance'] < 80000]
main_data[main_data['balance'] > 80000]
| age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | Target |
|---|
duration Outlier Treatment¶main_data[main_data['duration'] > 3500]
| age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9947 | 59 | management | married | secondary | no | 1321 | no | no | unknown | 9 | jun | 3881 | 3 | -1 | 0 | yes |
| 24148 | 59 | technician | married | tertiary | no | 6573 | yes | no | telephone | 10 | nov | 4918 | 1 | -1 | 0 | no |
| 44602 | 45 | services | single | secondary | no | 955 | no | no | unknown | 27 | aug | 3785 | 1 | -1 | 0 | no |
main_data = main_data[main_data['duration'] < 3500]
main_data[main_data['duration'] > 3500]
| age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | Target |
|---|
campaign Outlier Treatment¶main_data[main_data['campaign'] >= 50]
| age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4299 | 30 | management | single | tertiary | no | 358 | yes | no | unknown | 19 | may | 88 | 51 | -1 | 0 | no |
| 4330 | 45 | management | married | secondary | no | 9051 | yes | no | unknown | 19 | may | 124 | 63 | -1 | 0 | no |
| 5073 | 35 | technician | married | secondary | no | 432 | yes | no | unknown | 21 | may | 1094 | 55 | -1 | 0 | no |
| 5459 | 35 | blue-collar | married | secondary | no | 430 | yes | no | unknown | 23 | may | 147 | 50 | -1 | 0 | no |
| 11914 | 24 | technician | single | primary | no | 126 | yes | no | unknown | 20 | jun | 10 | 58 | -1 | 0 | no |
| 18713 | 35 | blue-collar | married | secondary | no | 280 | yes | yes | cellular | 31 | jul | 65 | 50 | -1 | 0 | no |
main_data = main_data[main_data['campaign'] < 50]
main_data[main_data['campaign'] >= 50]
| age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | Target |
|---|
previous Outlier Treatment¶main_data[main_data['previous'] >= 250]
| age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 29182 | 40 | management | married | tertiary | no | 543 | yes | no | cellular | 2 | feb | 349 | 2 | 262 | 275 | no |
main_data = main_data[main_data['previous'] < 250]
main_data[main_data['previous'] >= 250]
| age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | Target |
|---|
PlotHistBoxGrid(main_data)
['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']
main_data['duration_log'] = (main_data['duration']+1).transform(np.log)
main_data['balance_Log'] = (main_data['balance'] + 1).transform(np.log)
main_data.drop([
'duration',
'balance',
],
axis=1,
inplace=True)
main_data.replace([np.inf, -np.inf], np.nan, inplace=True)
main_data['duration_log'].fillna(main_data['duration_log'].mode(), inplace=True)
main_data['balance_Log'].fillna(main_data['balance_Log'].mode(), inplace=True)
info(main_data)
45197 Rows x 16 Columns
| Nulls | Non-Nulls | Nulls % | Type | |
|---|---|---|---|---|
| Target | 0 | 45197 | 0.00% | category |
| age | 0 | 45197 | 0.00% | int64 |
| balance_Log | 3766 | 41431 | 8.33% | float64 |
| campaign | 0 | 45197 | 0.00% | int64 |
| contact | 0 | 45197 | 0.00% | category |
| day | 0 | 45197 | 0.00% | int64 |
| default | 0 | 45197 | 0.00% | category |
| duration_log | 0 | 45197 | 0.00% | float64 |
| education | 0 | 45197 | 0.00% | category |
| housing | 0 | 45197 | 0.00% | category |
| job | 0 | 45197 | 0.00% | category |
| loan | 0 | 45197 | 0.00% | category |
| marital | 0 | 45197 | 0.00% | category |
| month | 0 | 45197 | 0.00% | category |
| pdays | 0 | 45197 | 0.00% | int64 |
| previous | 0 | 45197 | 0.00% | int64 |
main_data[main_data['balance_Log'] == np.inf]
| age | job | marital | education | default | housing | loan | contact | day | month | campaign | pdays | previous | Target | duration_log | balance_Log |
|---|
main_data[main_data['duration_log'] == np.inf]
| age | job | marital | education | default | housing | loan | contact | day | month | campaign | pdays | previous | Target | duration_log | balance_Log |
|---|
main_data[main_data['duration_log'] == -np.inf]
| age | job | marital | education | default | housing | loan | contact | day | month | campaign | pdays | previous | Target | duration_log | balance_Log |
|---|
main_data[main_data['balance_Log'] == -np.inf]
| age | job | marital | education | default | housing | loan | contact | day | month | campaign | pdays | previous | Target | duration_log | balance_Log |
|---|
main_data['balance_Log'].fillna(0, inplace=True)
info(main_data)
45197 Rows x 16 Columns
| Nulls | Non-Nulls | Nulls % | Type | |
|---|---|---|---|---|
| Target | 0 | 45197 | 0.00% | category |
| age | 0 | 45197 | 0.00% | int64 |
| balance_Log | 0 | 45197 | 0.00% | float64 |
| campaign | 0 | 45197 | 0.00% | int64 |
| contact | 0 | 45197 | 0.00% | category |
| day | 0 | 45197 | 0.00% | int64 |
| default | 0 | 45197 | 0.00% | category |
| duration_log | 0 | 45197 | 0.00% | float64 |
| education | 0 | 45197 | 0.00% | category |
| housing | 0 | 45197 | 0.00% | category |
| job | 0 | 45197 | 0.00% | category |
| loan | 0 | 45197 | 0.00% | category |
| marital | 0 | 45197 | 0.00% | category |
| month | 0 | 45197 | 0.00% | category |
| pdays | 0 | 45197 | 0.00% | int64 |
| previous | 0 | 45197 | 0.00% | int64 |
PlotHistBoxGrid(main_data)
['age', 'day', 'campaign', 'pdays', 'previous', 'duration_log', 'balance_Log']
replace_struct = {
"Target" : {"no": 0, "yes": 1},
}
main_data = main_data.replace(replace_struct)
main_data.head(10)
| age | job | marital | education | default | housing | loan | contact | day | month | campaign | pdays | previous | Target | duration_log | balance_Log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 58 | management | married | tertiary | no | yes | no | unknown | 5 | may | 1 | -1 | 0 | 0 | 5.568345 | 7.670429 |
| 1 | 44 | technician | single | secondary | no | yes | no | unknown | 5 | may | 1 | -1 | 0 | 0 | 5.023881 | 3.401197 |
| 2 | 33 | entrepreneur | married | secondary | no | yes | yes | unknown | 5 | may | 1 | -1 | 0 | 0 | 4.343805 | 1.098612 |
| 3 | 47 | blue-collar | married | secondary | no | yes | no | unknown | 5 | may | 1 | -1 | 0 | 0 | 4.532599 | 7.317876 |
| 4 | 33 | blue-collar | single | secondary | no | no | no | unknown | 5 | may | 1 | -1 | 0 | 0 | 5.293305 | 0.693147 |
| 5 | 35 | management | married | tertiary | no | yes | no | unknown | 5 | may | 1 | -1 | 0 | 0 | 4.941642 | 5.446737 |
| 6 | 28 | management | single | tertiary | no | yes | yes | unknown | 5 | may | 1 | -1 | 0 | 0 | 5.384495 | 6.104793 |
| 7 | 42 | entrepreneur | divorced | tertiary | yes | yes | no | unknown | 5 | may | 1 | -1 | 0 | 0 | 5.942799 | 1.098612 |
| 8 | 58 | retired | married | primary | no | yes | no | unknown | 5 | may | 1 | -1 | 0 | 0 | 3.931826 | 4.804021 |
| 9 | 43 | technician | single | secondary | no | yes | no | unknown | 5 | may | 1 | -1 | 0 | 0 | 4.025352 | 6.386879 |
sns.pairplot(main_data, kind="reg", hue=TARGET_COLUMN)
<seaborn.axisgrid.PairGrid at 0x7ff3aeb2ef10>
fig, ax = plt.subplots(figsize=(30,30))
sns.heatmap(data=main_data.corr(), annot=True, linewidths=.5, ax=ax)
plt.show()
main_data.head(10)
| age | job | marital | education | default | housing | loan | contact | day | month | campaign | pdays | previous | Target | duration_log | balance_Log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 58 | management | married | tertiary | no | yes | no | unknown | 5 | may | 1 | -1 | 0 | 0 | 5.568345 | 7.670429 |
| 1 | 44 | technician | single | secondary | no | yes | no | unknown | 5 | may | 1 | -1 | 0 | 0 | 5.023881 | 3.401197 |
| 2 | 33 | entrepreneur | married | secondary | no | yes | yes | unknown | 5 | may | 1 | -1 | 0 | 0 | 4.343805 | 1.098612 |
| 3 | 47 | blue-collar | married | secondary | no | yes | no | unknown | 5 | may | 1 | -1 | 0 | 0 | 4.532599 | 7.317876 |
| 4 | 33 | blue-collar | single | secondary | no | no | no | unknown | 5 | may | 1 | -1 | 0 | 0 | 5.293305 | 0.693147 |
| 5 | 35 | management | married | tertiary | no | yes | no | unknown | 5 | may | 1 | -1 | 0 | 0 | 4.941642 | 5.446737 |
| 6 | 28 | management | single | tertiary | no | yes | yes | unknown | 5 | may | 1 | -1 | 0 | 0 | 5.384495 | 6.104793 |
| 7 | 42 | entrepreneur | divorced | tertiary | yes | yes | no | unknown | 5 | may | 1 | -1 | 0 | 0 | 5.942799 | 1.098612 |
| 8 | 58 | retired | married | primary | no | yes | no | unknown | 5 | may | 1 | -1 | 0 | 0 | 3.931826 | 4.804021 |
| 9 | 43 | technician | single | secondary | no | yes | no | unknown | 5 | may | 1 | -1 | 0 | 0 | 4.025352 | 6.386879 |
CountUniqueValues(catgry_col_names)
======================='job'================== management : 9452 (20.91%) technician : 7594 (16.8%) entrepreneur : 1487 (3.29%) blue-collar : 10018 (22.17%) retired : 2262 (5.0%) admin. : 5171 (11.44%) services : 4153 (9.19%) self-employed : 1579 (3.49%) unemployed : 1303 (2.88%) housemaid : 1240 (2.74%) student : 938 (2.08%) ========================================================= ======================='marital'================== married : 27204 (60.19%) single : 12786 (28.29%) divorced : 5207 (11.52%) ========================================================= ======================='education'================== tertiary : 13296 (29.42%) secondary : 25051 (55.43%) primary : 6850 (15.16%) ========================================================= ======================='default'================== no : 44382 (98.2%) yes : 815 (1.8%) ========================================================= ======================='housing'================== yes : 25122 (55.58%) no : 20075 (44.42%) ========================================================= ======================='loan'================== no : 37954 (83.97%) yes : 7243 (16.03%) ========================================================= ======================='contact'================== unknown : 13013 (28.79%) cellular : 29282 (64.79%) telephone : 2902 (6.42%) ========================================================= ======================='month'================== may : 13762 (30.45%) jun : 5338 (11.81%) jul : 6894 (15.25%) aug : 6246 (13.82%) oct : 738 (1.63%) nov : 3968 (8.78%) dec : 213 (0.47%) jan : 1403 (3.1%) feb : 2648 (5.86%) mar : 477 (1.06%) apr : 2931 (6.48%) sep : 579 (1.28%) ========================================================= ======================='Target'================== 0 : 39911 (88.3%) 1 : 5286 (11.7%) =========================================================
replace_struct = {
"month" : {"jan": 1, "feb": 2, "apr": 3, "mar": 4, "may": 5, "jun":6, "jul":7, "aug":8, "sep":9, "oct": 10, "nov": 11, "dec": 12},
"contact" : {"unknown": -1, "telephone": 1, "cellular": 2},
"education" : {"primary": 1, "secondary": 2, "tertiary": 3},
}
main_data = main_data.replace(replace_struct)
main_data.head(10)
| age | job | marital | education | default | housing | loan | contact | day | month | campaign | pdays | previous | Target | duration_log | balance_Log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 58 | management | married | 3 | no | yes | no | -1 | 5 | 5 | 1 | -1 | 0 | 0 | 5.568345 | 7.670429 |
| 1 | 44 | technician | single | 2 | no | yes | no | -1 | 5 | 5 | 1 | -1 | 0 | 0 | 5.023881 | 3.401197 |
| 2 | 33 | entrepreneur | married | 2 | no | yes | yes | -1 | 5 | 5 | 1 | -1 | 0 | 0 | 4.343805 | 1.098612 |
| 3 | 47 | blue-collar | married | 2 | no | yes | no | -1 | 5 | 5 | 1 | -1 | 0 | 0 | 4.532599 | 7.317876 |
| 4 | 33 | blue-collar | single | 2 | no | no | no | -1 | 5 | 5 | 1 | -1 | 0 | 0 | 5.293305 | 0.693147 |
| 5 | 35 | management | married | 3 | no | yes | no | -1 | 5 | 5 | 1 | -1 | 0 | 0 | 4.941642 | 5.446737 |
| 6 | 28 | management | single | 3 | no | yes | yes | -1 | 5 | 5 | 1 | -1 | 0 | 0 | 5.384495 | 6.104793 |
| 7 | 42 | entrepreneur | divorced | 3 | yes | yes | no | -1 | 5 | 5 | 1 | -1 | 0 | 0 | 5.942799 | 1.098612 |
| 8 | 58 | retired | married | 1 | no | yes | no | -1 | 5 | 5 | 1 | -1 | 0 | 0 | 3.931826 | 4.804021 |
| 9 | 43 | technician | single | 2 | no | yes | no | -1 | 5 | 5 | 1 | -1 | 0 | 0 | 4.025352 | 6.386879 |
oneHotCols = ["housing", "default", "marital", "job", "loan"]
main_data = pd.get_dummies(main_data, columns=oneHotCols)
info(main_data)
45197 Rows x 31 Columns
| Nulls | Non-Nulls | Nulls % | Type | |
|---|---|---|---|---|
| Target | 0 | 45197 | 0.00% | int64 |
| age | 0 | 45197 | 0.00% | int64 |
| balance_Log | 0 | 45197 | 0.00% | float64 |
| campaign | 0 | 45197 | 0.00% | int64 |
| contact | 0 | 45197 | 0.00% | int64 |
| day | 0 | 45197 | 0.00% | int64 |
| default_no | 0 | 45197 | 0.00% | uint8 |
| default_yes | 0 | 45197 | 0.00% | uint8 |
| duration_log | 0 | 45197 | 0.00% | float64 |
| education | 0 | 45197 | 0.00% | int64 |
| housing_no | 0 | 45197 | 0.00% | uint8 |
| housing_yes | 0 | 45197 | 0.00% | uint8 |
| job_admin. | 0 | 45197 | 0.00% | uint8 |
| job_blue-collar | 0 | 45197 | 0.00% | uint8 |
| job_entrepreneur | 0 | 45197 | 0.00% | uint8 |
| job_housemaid | 0 | 45197 | 0.00% | uint8 |
| job_management | 0 | 45197 | 0.00% | uint8 |
| job_retired | 0 | 45197 | 0.00% | uint8 |
| job_self-employed | 0 | 45197 | 0.00% | uint8 |
| job_services | 0 | 45197 | 0.00% | uint8 |
| job_student | 0 | 45197 | 0.00% | uint8 |
| job_technician | 0 | 45197 | 0.00% | uint8 |
| job_unemployed | 0 | 45197 | 0.00% | uint8 |
| loan_no | 0 | 45197 | 0.00% | uint8 |
| loan_yes | 0 | 45197 | 0.00% | uint8 |
| marital_divorced | 0 | 45197 | 0.00% | uint8 |
| marital_married | 0 | 45197 | 0.00% | uint8 |
| marital_single | 0 | 45197 | 0.00% | uint8 |
| month | 0 | 45197 | 0.00% | int64 |
| pdays | 0 | 45197 | 0.00% | int64 |
| previous | 0 | 45197 | 0.00% | int64 |
fig, ax = plt.subplots(figsize=(30,30))
sns.heatmap(data=main_data.corr(), annot=True, linewidths=.5, ax=ax)
plt.show()
list_of_features_to_drop = main_data.corr()[abs(main_data.corr()[TARGET_COLUMN]) < 0.001].index.to_list()
main_data.drop(list_of_features_to_drop, axis=1, inplace=True)
info(main_data)
45197 Rows x 30 Columns
| Nulls | Non-Nulls | Nulls % | Type | |
|---|---|---|---|---|
| Target | 0 | 45197 | 0.00% | int64 |
| age | 0 | 45197 | 0.00% | int64 |
| balance_Log | 0 | 45197 | 0.00% | float64 |
| campaign | 0 | 45197 | 0.00% | int64 |
| contact | 0 | 45197 | 0.00% | int64 |
| day | 0 | 45197 | 0.00% | int64 |
| default_no | 0 | 45197 | 0.00% | uint8 |
| default_yes | 0 | 45197 | 0.00% | uint8 |
| duration_log | 0 | 45197 | 0.00% | float64 |
| education | 0 | 45197 | 0.00% | int64 |
| housing_no | 0 | 45197 | 0.00% | uint8 |
| housing_yes | 0 | 45197 | 0.00% | uint8 |
| job_admin. | 0 | 45197 | 0.00% | uint8 |
| job_blue-collar | 0 | 45197 | 0.00% | uint8 |
| job_entrepreneur | 0 | 45197 | 0.00% | uint8 |
| job_housemaid | 0 | 45197 | 0.00% | uint8 |
| job_management | 0 | 45197 | 0.00% | uint8 |
| job_retired | 0 | 45197 | 0.00% | uint8 |
| job_services | 0 | 45197 | 0.00% | uint8 |
| job_student | 0 | 45197 | 0.00% | uint8 |
| job_technician | 0 | 45197 | 0.00% | uint8 |
| job_unemployed | 0 | 45197 | 0.00% | uint8 |
| loan_no | 0 | 45197 | 0.00% | uint8 |
| loan_yes | 0 | 45197 | 0.00% | uint8 |
| marital_divorced | 0 | 45197 | 0.00% | uint8 |
| marital_married | 0 | 45197 | 0.00% | uint8 |
| marital_single | 0 | 45197 | 0.00% | uint8 |
| month | 0 | 45197 | 0.00% | int64 |
| pdays | 0 | 45197 | 0.00% | int64 |
| previous | 0 | 45197 | 0.00% | int64 |
'''
Description:
Given the model, computes models perfomance on test and training data.
Perf metrics displayed are: Accuracy, Recall, F1 Score, and precision.
Input:
model - The learning model.
Returns:
A dictionary containing models perfomace.
'''
def GetMetricsScore(model, x_train_arg, x_test_arg, y_train_arg, y_test_arg):
pred_train = model.predict(x_train_arg)
pred_test = model.predict(x_test_arg)
train_accuracy = model.score(x_train_arg, y_train_arg)
test_accuracy = model.score(x_test_arg, y_test_arg)
train_recall = metrics.recall_score(y_train_arg, pred_train)
test_recall = metrics.recall_score(y_test_arg, pred_test)
train_precision = metrics.precision_score(y_train_arg, pred_train)
test_precision = metrics.precision_score(y_test_arg, pred_test)
f1_score_train = 2 * ((train_precision * train_recall)/(train_precision + train_recall))
f1_score_test = 2 * ((test_precision * test_recall)/(test_precision + test_recall))
return {
'Accuracy_Test' : test_accuracy,
'Accuracy_Train' : train_accuracy,
'Recall_Test' : test_recall,
'Recall_Train' : train_recall,
'Precision_Test' : test_precision,
'Precision_Train' : train_precision,
'F1_Score_Train' : f1_score_train,
'F1_Score_Test' : f1_score_test
}
def DisplayConfusionMatrix(model, y_actual, model_name, labels=[1,0]):
y_predict = model.predict(x_test)
confusion_matrix = metrics.confusion_matrix(y_actual, y_predict, labels=[0, 1])
confusion_matix_df = pd.DataFrame(
confusion_matrix,
index = [i for i in ["Actual No", "Actual Yes"]],
columns = [i for i in ["Predicted - No", "Predicted - Yes"]])
group_counts = ["{0:0.0f}".format(value) for value in confusion_matrix.flatten()]
group_percetages = ["{0:.2%}".format(value) for value in confusion_matrix.flatten()/np.sum(confusion_matrix)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percetages)]
labels = np.array(labels).reshape(2, 2)
plt.figure(figsize= (10,7))
sns.heatmap(confusion_matix_df, annot=labels,fmt='')
plt.ylabel("True Label")
plt.xlabel("Predicted Label")
plt.title(model_name)
def GetModelsScoreDataFrame(models, x_train_var, x_test_var, y_train_var, y_test_var):
scores = None
for model_name in models:
scores = GetMetricsScore(models[model_name], x_train_var, x_test_var, y_train_var, y_test_var)
print(f"{model_name}")
scores_overview_df = pd.DataFrame(columns=scores.keys())
for model_name in models:
scores = GetMetricsScore(models[model_name], x_train_var, x_test_var, y_train_var, y_test_var)
scores_overview_df.loc[model_name] = scores
return scores_overview_df
def ConfusionMatrixBulkPlot(models, y):
for clf_model_name in models:
clf_model = models[clf_model_name]
DisplayConfusionMatrix(model=clf_model, y_actual=y, model_name = clf_model_name)
def DisplayImportanceChart(model):
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title('Feature Importance')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [list(X.columns)[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
X = main_data.drop(TARGET_COLUMN, axis=1)
Y = main_data.pop(TARGET_COLUMN)
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.5, random_state=1, stratify=Y)
def BuildAndGetBaggingModels(x_var, y_var):
bagging_estimator=BaggingClassifier(random_state=1)
bagging_estimator.fit(x_var, y_var)
pprint.pprint(bagging_estimator)
rf_estimator=RandomForestClassifier(random_state=1)
rf_estimator.fit(x_var, y_var)
pprint.pprint(rf_estimator)
dtree1 = DecisionTreeClassifier(random_state=1)
dtree1.fit(x_var, y_var)
pprint.pprint(dtree1)
models = {
'Bagging Classifier' : bagging_estimator,
'RandomForest Model' : rf_estimator,
'Decision Tree' : dtree1
}
return models
models = BuildAndGetBaggingModels(x_train, y_train)
pprint.pprint(models)
BaggingClassifier(random_state=1)
RandomForestClassifier(random_state=1)
DecisionTreeClassifier(random_state=1)
{'Bagging Classifier': BaggingClassifier(random_state=1),
'Decision Tree': DecisionTreeClassifier(random_state=1),
'RandomForest Model': RandomForestClassifier(random_state=1)}
results = GetModelsScoreDataFrame(models, x_train, x_test, y_train, y_test)
results.head()
Bagging Classifier RandomForest Model Decision Tree
| Accuracy_Test | Accuracy_Train | Recall_Test | Recall_Train | Precision_Test | Precision_Train | F1_Score_Train | F1_Score_Test | |
|---|---|---|---|---|---|---|---|---|
| Bagging Classifier | 0.895482 | 0.991902 | 0.373439 | 0.938328 | 0.582989 | 0.992 | 0.964418 | 0.455258 |
| RandomForest Model | 0.900704 | 0.999956 | 0.366250 | 0.999622 | 0.629798 | 1.000 | 0.999811 | 0.463158 |
| Decision Tree | 0.870923 | 1.000000 | 0.458948 | 1.000000 | 0.449259 | 1.000 | 1.000000 | 0.454052 |
ConfusionMatrixBulkPlot(models, y_test)
DisplayImportanceChart(models['RandomForest Model'])
DisplayImportanceChart(models['Decision Tree'])
bagging_estimator=BaggingClassifier(random_state=1)
parameters = {
'max_samples' : [0.7, 0.8, 0.9, 1],
'max_features' : [0.7, 0.8, 0.9, 1],
'n_estimators' : [10, 20, 30, 40, 50]
}
acc_scorer = metrics.make_scorer(metrics.recall_score)
grid_obj = GridSearchCV(bagging_estimator, parameters, scoring=acc_scorer, cv=5)
gird_obj = grid_obj.fit(x_train, y_train)
bagging_estimator = grid_obj.best_estimator_
bagging_estimator.fit(x_train, y_train)
BaggingClassifier(max_features=0.9, max_samples=0.9, n_estimators=20,
random_state=1)
dtree1 = DecisionTreeClassifier(random_state=1)
parameters = {
'criterion':['gini','entropy'],
'max_depth': np.arange(3, 15)
}
acc_scorer = metrics.make_scorer(metrics.recall_score)
grid_obj = GridSearchCV(dtree1, parameters, scoring=acc_scorer, cv=5)
gird_obj = grid_obj.fit(x_train,y_train)
dtree1 = grid_obj.best_estimator_
dtree1.fit(x_train,y_train)
DisplayImportanceChart(dtree1)
randomForest_model = RandomForestClassifier(random_state=1)
parameters = {
'class_weight' : [{0: 0.3, 1: 0.7}],
'min_samples_leaf' : np.arange(5, 7),
'max_features' : np.arange(0.2,0.4, 0.1),
'max_samples' : np.arange(0.3, 0.5, 0.1),
"n_estimators" : [50, 100, 150]
}
acc_scorer = metrics.make_scorer(metrics.recall_score)
grid_obj = GridSearchCV(randomForest_model, parameters, scoring=acc_scorer, cv=5)
gird_obj = grid_obj.fit(x_train, y_train)
randomForest_model = grid_obj.best_estimator_
randomForest_model.fit(x_train, y_train)
DisplayImportanceChart(randomForest_model)
models = {
'Bagging Classifier' : bagging_estimator,
'RandomForest Model' : randomForest_model,
'Decision Tree' : dtree1
}
results = GetModelsScoreDataFrame(models, x_train, x_test, y_train, y_test)
results.head()
Bagging Classifier RandomForest Model Decision Tree
| Accuracy_Test | Accuracy_Train | Recall_Test | Recall_Train | Precision_Test | Precision_Train | F1_Score_Train | F1_Score_Test | |
|---|---|---|---|---|---|---|---|---|
| Bagging Classifier | 0.898889 | 0.995486 | 0.408248 | 0.964056 | 0.599444 | 0.997260 | 0.980377 | 0.485708 |
| RandomForest Model | 0.901279 | 0.922737 | 0.584941 | 0.690125 | 0.576866 | 0.663032 | 0.676307 | 0.580875 |
| Decision Tree | 0.884774 | 0.943402 | 0.433220 | 0.658721 | 0.508663 | 0.822002 | 0.731359 | 0.467920 |
ConfusionMatrixBulkPlot(models, y_test)
def BuildAndGetBoostingModels(x_var, y_var):
adaBoosting_Model = AdaBoostClassifier(random_state=1)
adaBoosting_Model.fit(x_var, y_var)
pprint.pprint(adaBoosting_Model)
gradientboost_model = GradientBoostingClassifier(random_state=1)
gradientboost_model.fit(x_var, y_var)
pprint.pprint(gradientboost_model)
models = {
'Gradient Boost' : gradientboost_model,
'AdaBoostClassifier' : adaBoosting_Model,
}
return models
models = BuildAndGetBoostingModels(x_train, y_train)
pprint.pprint(models)
AdaBoostClassifier(random_state=1)
GradientBoostingClassifier(random_state=1)
{'AdaBoostClassifier': AdaBoostClassifier(random_state=1),
'Gradient Boost': GradientBoostingClassifier(random_state=1)}
results = GetModelsScoreDataFrame(models, x_train, x_test, y_train, y_test)
results.head()
Gradient Boost AdaBoostClassifier
| Accuracy_Test | Accuracy_Train | Recall_Test | Recall_Train | Precision_Test | Precision_Train | F1_Score_Train | F1_Score_Test | |
|---|---|---|---|---|---|---|---|---|
| Gradient Boost | 0.902341 | 0.909284 | 0.387817 | 0.417329 | 0.635068 | 0.683819 | 0.518327 | 0.481560 |
| AdaBoostClassifier | 0.893447 | 0.892778 | 0.323118 | 0.323874 | 0.579769 | 0.573727 | 0.414027 | 0.414966 |
ConfusionMatrixBulkPlot(models, y_test)
adaBoosting_Model = AdaBoostClassifier(random_state=1)
parameters = {
"n_estimators": np.arange(10, 110, 10),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
acc_scorer = metrics.make_scorer(metrics.recall_score)
grid_obj = GridSearchCV(adaBoosting_Model, parameters, scoring=acc_scorer, cv=5)
gird_obj = grid_obj.fit(x_train,y_train)
adaBoosting_Model = grid_obj.best_estimator_
adaBoosting_ModeOl.fit(x_train,y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=1, n_estimators=90, random_state=1)
DisplayImportanceChart(adaBoosting_Model)
gradientboost_model = GradientBoostingClassifier(random_state=1)
parameters = {
"subsample" : [0.8, 0.9, 1],
"n_estimators" : [100, 150, 250],
"max_features" : [0.7, 0.8, 0.9, 1]
}
acc_scorer = metrics.make_scorer(metrics.recall_score)
scorer = metrics.make_scorer(metrics.recall_score)
grid_obj = GridSearchCV(gradientboost_model, parameters, scoring=acc_scorer, cv=5)
gird_obj = grid_obj.fit(x_train,y_train)
gradientboost_model = grid_obj.best_estimator_
gradientboost_model.fit(x_train,y_train)
GradientBoostingClassifier(max_features=0.7, n_estimators=250, random_state=1,
subsample=1)
DisplayImportanceChart(gradientboost_model)
estimators = [('Random Forest',randomForest_model), ('Gradient Boosting',gradientboost_model), ('Decision Tree',dtree1)]
stacking_model = StackingClassifier(estimators= estimators , final_estimator=DecisionTreeClassifier())
stacking_model.fit(x_train,y_train)
StackingClassifier(estimators=[('Random Forest',
RandomForestClassifier(class_weight={0: 0.3,
1: 0.7},
max_features=0.30000000000000004,
max_samples=0.4,
min_samples_leaf=6,
random_state=1)),
('Gradient Boosting',
GradientBoostingClassifier(max_features=0.7,
n_estimators=250,
random_state=1,
subsample=1)),
('Decision Tree',
DecisionTreeClassifier(criterion='entropy',
max_depth=12,
random_state=1))],
final_estimator=DecisionTreeClassifier())
models = {
'Gradient Boost' : gradientboost_model,
'AdaBoostClassifier' : adaBoosting_Model,
'Stacking' : stacking_model
}
results = GetModelsScoreDataFrame(models, x_train, x_test, y_train, y_test)
results.head()
Gradient Boost AdaBoostClassifier Stacking
| Accuracy_Test | Accuracy_Train | Recall_Test | Recall_Train | Precision_Test | Precision_Train | F1_Score_Train | F1_Score_Test | |
|---|---|---|---|---|---|---|---|---|
| Gradient Boost | 0.904288 | 0.916409 | 0.413167 | 0.468407 | 0.640845 | 0.718931 | 0.567239 | 0.502415 |
| AdaBoostClassifier | 0.898668 | 0.927383 | 0.457813 | 0.603481 | 0.585389 | 0.728976 | 0.660319 | 0.513800 |
| Stacking | 0.868888 | 0.879193 | 0.463110 | 0.503216 | 0.442197 | 0.484165 | 0.493506 | 0.452412 |
ConfusionMatrixBulkPlot(models, y_test)